Language Identification Based on High Frequency Approaches

نویسندگان

  • Kheireddine Abainia
  • Siham Ouamour
  • Halim Sayoud
چکیده

This paper deals with the problem of automatic language identification of noisy texts, which represents an important task in natural language processing. Actually, there exist several works in this field, which are based on statistical and machine learning approaches for different categories of texts. Unfortunately, most of the proposed methods work fine on clean texts or long texts, but often present a failure when the text is corrupted or too short. In this research work, we use a typical dataset consisting of short texts collected from several discussion forums containing several types of noises. Our dataset contains 32 different languages; where we notice that some languages are quite different while some others are too closed. In this investigation, we propose two types of methods to identify the text language: term-based method and character-based method. Moreover, we propose two hybrid methods to enhance the performances of those techniques. Experiments show that the proposed hybrid methods are quite interesting and present good language identification performances in noisy texts. Keywords—Natural Language Processing; Text categorization; Automatic Language Identification; Noisy Text; Hybrid Approach.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

مقایسه روش های طیفی برای شناسایی زبان گفتاری

Identifying spoken language automatically is to identify a language from the speech signal. Language identification systems can be divided into two categories, spectral-based methods and phonetic-based methods. In the former, short-time characteristics of speech spectrum are extracted as a multi-dimensional vector. The statistical model of these features is then obtained for each language. The ...

متن کامل

Design and Implementation of Field Programmable Gate Array Based Baseband Processor for Passive Radio Frequency Identification Tag (TECHNICAL NOTE)

In this paper, an Ultra High Frequency (UHF) base band processor for a passive tag is presented. It proposes a Radio Frequency Identification (RFID) tag digital base band architecture which is compatible with the EPC C C2/ISO18000-6B protocol. Several design approaches such as clock gating technique, clock strobe design and clock management are used. In order to reduce the area Decimal Matrix C...

متن کامل

Identification of High-Frequency Morphosyntactic Structures in Persian-Speaking Children Aged 4-6 Years: A Qualitative Research

Background: Syntax has a high importance among linguistic parameters and the prevalence of syntax deficits is relatively high in children with language disorders. As such, independent examination of syntax in language development is of paramount importance. In this regard, Iranian language pathologists are faced with the lack of standardized tests. The present study aimed to determine the most ...

متن کامل

Collocational Processing in Two Languages: A psycholinguistic comparison of monolinguals and bilinguals

With the renewed interest in the field of second language learning for the knowledge of collocating words, research findings in favour of holistic processing of formulaic language could support the idea that these language units facilitate efficient language processing. This study investigated the difference between processing of a first language (L1) and a second language (L2) of congruent col...

متن کامل

Offline Language-free Writer Identification based on Speeded-up Robust Features

This article proposes offline language-free writer identification based on speeded-up robust features (SURF), goes through training, enrollment, and identification stages. In all stages, an isotropic Box filter is first used to segment the handwritten text image into word regions (WRs). Then, the SURF descriptors (SUDs) of word region and the corresponding scales and orientations (SOs) are extr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014